Relevance Feedback and Query Expansion for Searching the Web: A Model for Searching a Digital Library

نویسندگان

Alan F. Smeaton

Francis Crimmins

چکیده

A fully operational large scale digital library is likely to be based on a distributed architecture and because of this it is likely that a number of independent search engines may be used to index different overlapping portions of the entire contents of the library. In any case, different media, text, audio, image, etc., will be indexed for retrieval by different search engines so techniques which provide a coherent and unified search over a suite of underlying independent search engines are thus likely to be an important part of navigating in a digital library. In this paper we present an architecture and a system for searching the world’s largest DL, the world wide web. What makes our system novel is that we use a suite of underlying web search engines to do the bulk of the work while our system orchestrates them in a parallel fashion to provide a higher level of information retrieval functionality. Thus it is our meta search engine and not the underlying direct search engines that provide the relevance feedback and query expansion options for the user. The paper presents the design and architecture of the system which has been implemented, describes an initial version which has been operational for almost a year, and outlines the operation of the advanced version. 1. Information Retrieval in a Digital Library Much of the push for the development and use of multimedia information has been on the development of the technology and so we have seen much work in networking, compression, transmission, storage, presentation and delivery. In order to make effective use of any kind of electronic information, as found in a digital library (DL), the organisation and manipulation of information by content is a crucial component. Thus information retrieval is a key technology in the development of DLs. The development of information retrieval techniques over the last few decades has been precipitated upon its deployment in a centralised system and though the development of distributed IR systems such as WAIS (wide area information servers) has taken place these have not had anything other than minor impact on the field. Even the world wide web (WWW), the largest distributed collection in the world, is searched by the vast majority of users using centralised IR indexes. A fully operational large scale DL is likely to be based on a distributed architecture and because of this it is likely that a number of independent search engines may be used to index different but overlapping portions of the entire contents of the library. In any case, different media such as text, audio, image, etc., will be indexed for retrieval using different indexing techniques and by different search engines so any kind of retrieval or navigation techniques which provide a coherent and unified search over a suite of underlying independent search engines are thus likely to be an important part of navigating in a digital library. It is well-known in information retrieval that techniques such as relevance feedback and query expansion provide an improvement in retrieval effectiveness over straightforward keyword weighting and matching. Implementing such techniques over a suite of underlying search engines is thus desirable from a user’s point of view as it allows the individual search engines to remain relatively straightforward and uncomplicated while still delivering advanced search options to the user. Given that this approach is a possible paradigm for searching in a digital library and thus is important for the DL community, in this paper we present a technique and a design for an implementation of searching the WWW based on broadcasting a user’s search to a number of conventional search engines and combining the results into one overall ranked list. Searching the WWW using such IR techniques is a noble task unto itself but it is also an appropriate model for the kind of IR system we outlined above as there exist a number of search engines which provide straightforward keyword weighting and term matching over overlapping portions of the entire web. Our approach to retrieval is effectively a meta search engine and the concept and other systems which do this for the web are described in the next section of this paper. We also allow a user to feed back to our system which URLs are relevant to the query and which are not and from this information we can generate a ranked list of search terms which the user can choose to add to the search, or can have the system add them all. The general approaches taken to searching the web are examined in section 2 and meta search engines for web searching are presented in section 3. The algorithm we use to deliver our information retrieval technique and the architecture of our system, is described in section 4 which also includes some screendumps of the user interface. In section 5 we report on the status of our implementation and in section 6 we give an analysis of our approach. A final section presents some of our plans for extending this work and some final conclusions. 2. Searching the World Wide Web Soon after the world wide web (WWW) was launched some years ago, several groups realised, independently, that effective access to information on the web could not be provided by allowing users to follow hypertext links serendipitously, or by attempting to maintain a classification or directory of the web’s content. This led to the development of systems whose function was to constantly “crawl” the web, seeking new pages or updates of old pages and having discovered a new or recently updated page then download and index that page into its local database or index files. The local index could then be made available to the internet community for searching. Such web crawling programs provide the source of the input documents being indexed and when a user queries a search engine such as Lycos then the “documents” returned are actually pointers to the original documents on some remote WWW server. Since the growth of the web has really taken off a number of other search engines have joined Lycos in their respective attempts to index the entire WWW. Such additions include AltaVista, Excite, InfoSeek and many others. All of these search engines have many things in common including poor support for the concept of a search “session” between user and system. In practice when we search any system in response to our information needs we tend to have shifting or evolving requirements. These arise as a result of the natural evolution of our needs ... as we see some documents on some sub-topic of our query we may feel this aspect to be satisfied and we may wish to concentrate on some other sub-topic of our query. In addition, as we examine the content of retrieved documents we expand our vocabulary of the domain of our search, i.e. we get to know more of what we are looking for and we discover good search terms, and thus we may be in a position to expand our original query with additional search terms and/or add specific user-determined weights to search terms based on documents seen so far. Query expansion, as described above, is not present in most conventional search engines with the exception of AltaVista LiveTopics yet it is known in experimental information retrieval to be an effective aid to the information retrieval task [Smeaton & van Rijsbergen 83]. Relevance feedback is the concept whereby a user’s judgement as to the relevance or otherwise of a retrieved document relative to the query is fed back to the search system so that the system may use this information to improve the remainder of the search. This information could be used to automatically add extra search terms or to assign new weights to existing search terms. The usefulness of this as an IR technique has been known about for decades and one of the most successful techniques for doing this was published over 20 years ago [Robertson & Sparck Jones 76]. Technologies such as query expansion and relevance feedback which have been developed for some time and have been demonstrated as improving retrieval effectiveness on large, multi-gigabyte text collections such as TREC [Harman 96] but have not appeared in global web search engines. LiveTopics from AltaVista comes close to this but the query expansion in this case is based on topranked documents and not limited to relevant ones as judged by the user. MUSCAT also provides similar functions though it is used for searching intranets or in searching some geographic sub-portion of the global web. What conventional web search engines do and do well is attempt to index as large a portion of the web as possible and to maintain these indexes to be as current as possible. They generally provide term weighted retrieval, returning a ranked list of 1 http://www.altavista.digital.com/ 2 The URL here is: http://www.muscat.co.uk/ URLs and they do this efficiently. Some provide functionality above the simple list of words as an input query by allowing query phrases, for example. In order to attract customers and in turn advertising revenue, web search engines compete on the size of their indexes or the portion of the web that they claim to have indexed. Thus the emphasis has been on web coverage rather than on effectiveness and efficiency of searching does not seem to be a problem. Despite the large engineering efforts which go into delivering web search technology, users are easily dissatisfied with the service, in particular with the amount of nonrelevant URLs retrieved. Clearly, by concentrating on coverage rather than effectiveness, web search engine developers have done the right thing at the time of rapid web growth. A search engine which effectively searched only a small portion of the web would probably attract few customers whereas an engine which delivered a search on all of the web but put the burden of sifting through retrieved URLs onto the user, would attract more custom. At this time, however, we need to see web search engine functionality enhanced and some of the more advanced IR techniques which are known to work, incorporated into the services. 3. WWW Meta Search Engines The idea of using one or more IR search engines as a basis underlying a more sophisticated or elaborate search functionality is not new. In 1982 Morrissey [Morrissey 82] described an intelligent terminal which took a user’s query expressed as a set of keywords and implemented term weighting and document ranking by broadcasting numerous independent searches to an underlying boolean IR system. This provided an IR functionality, weighted search terms and document ranking, on top of a less sophisticated boolean search engine. In the Harvest system [Bowman et al. 95] brokers provide the indexing and query interface to the gathered information. They achieve this by requesting information from information gatherers and other brokers. This layered approach allows efficient use of network bandwidth & resources. The GLOSS service [Gravano et al. 94] is one which suggests potentially good databases to search, based on word-frequency information for each database. A users query is sent to the GLOSS server which then evaluates it at the chosen databases. A similar approach to using one search system on top of another has also been developed for searching the web with the emergence of meta-search engines, to which our work would be comparable. Examples of this type would be Highway 61, Inference Find!, Mamma, MetaCrawler, ProFusion and SavvySearch. These systems all operate in essentially the same way, querying some underlying WWW search engines in parallel in order to answer user queries. Where they differ is in the 3 The URLs are respectively : http://www.highway61.com http://m5.inference.com/ifind http://www.mamma.com http://www.metacrawler.com http://www.designlab.ukans.edu/profusion http://williams.cs.colostate.edu:1969 processing they perform on the results returned by the WWW search engines before presenting them to the user. Highway 61, Mamma, MetaCrawler and ProFusion combine their results by using a data fusion technique based on document score i.e. they sum the scores given to a document by the different engines. MetaCrawler and ProFusion offer broken link detection as well, although this results in an increase in query time. Inference Find! clusters the documents returned by the search engines into groups based on their location i.e. what WWW site they are at. MetaCrawler also offers this as an alternative to ranking based on score. SavvySearch has a large number of underlying engines it knows about and concentrates on selecting a subset of these search engines to route a users query to. One of the differences between our work and these other meta-search services is that the others all use HTML forms as their user interface. This limits the functionality and interaction which can be offered to users. As we will see later in this paper, our client/server architecture and its implementation in Java allows us to offer an improved and more interactive interface to the user but most importantly offers us more scope for development. Our system also incorporates some more effective IR, namely relevance feedback and query expansion. The meta search engines which we mentioned above should not be confused with the so-called all-in-one pages such as All-In-One, CUSI, Find-It! and Search.com. These are basically a compilation of the form interfaces of different search tools found on the web. They cover a number of general and specialised engines, divided into categories e.g. web, software, people, technical reports etc. There is no parallelism or combination of results involved as they simply redirect the browser to the relevant engine with the appropriate query. 4 The URLs for these are: http://www.albany.net/allinone/ http://pubweb.nexor.co.uk/public/cusi/cusi.html http://www.iTools.com/find-it/findit.html http://www.search.com 4. Our Meta Search Algorithm and Architecture for Retrieval Our system uses a client-server architecture with the client being a Java applet running on the user’s machine and the server program, also written in Java, running on a Sun Ultra Sparc 2. The client is lightweight in terms of the computational processing it performs in order to have it operational on low-spec machines and the emerging network computers. An architecture such as ours is termed a “knowledgeserver” by Eriksson [Eriksson 96]. Our algorithm for retrieving information from the web begins by inviting the user to input a set of keywords or search terms into the client applet. These may be individual words or they may be phrases delimited by inverted commas. When the user has input a query and pressed the “Run Query” button, this query is then passed back to our server from where it is broadcast in parallel to 6 web search engines, AltaVista, Excite, Infoseek, Lycos, OpenText and WebCrawler. In passing on a user’s search to a web search engine we request each system to return its top 100 ranked URLs. This is because of the very poor overlap we have observed in the top-ranked document lists from different web search engines [Smeaton & Crimmins 97]. Of the web search engines we interrogate, only WebCrawler supports a direct request for the top 100 ranked URLs. AltaVista, Excite, InfoSeek and OpenText return only 10 URLs per search request so we break our user’s query into 10 individual queries for each of these engines requesting the top 10 URLs, URLs ranked 11 to 20, URLs ranked 21 to 30, etc. Lycos can be interrogated to retrieve a maximum of 40 URLs with one search so this is handled by breaking into 3 separate threads to get the top 100. This yields a total of 44 parallel threads or requests to search engines. After a time-out period (currently set to 25 seconds) we perform a data fusion operation on the URL lists returned from search engines at that point. This data fusion is performed on our server machine and is based on rank position rather than retrieval status value (RSV) or URL score as not all search engines return scores. For those engines that do return URL scores the range for these is not consistent across search engines with some search engines having no upper limit for a URL’s score. The ranked results are stored in a hash table, with the URLs being used to generate the hash code. Thus duplicate objects have their ranks summed, and the objects are penalised if they have not been retrieved by a particular search engine. The table is then sorted into ascending order based on rank and the data fusion process is complete. More details on this process are available in [Smeaton & Crimmins 97]. We take the fused ranking of URLs and send them back to the user’s client applet for display to the user. Figure 1 shows a screendump of the client applet where the user has input a single-term query, “Orbital” which has been processed and the top URLs in the fused ranking from the search engines are displayed. The user can double-click any of these or select one and press the “Load URL” button to have that web page retrieved from the web and displayed in the user’s WWW browser. As more and more of the responses from the search engines which have not returned before the first time-out period come back to our server machine, an updated URL ranking is generated and this updated overall ranking periodically gets propagated back to the user’s applet window. Figure 1: Screendump of a user’s search having retrieved an initial ranking of URLs As a user selects URLs on this list, which is scrollable, the full URL is displayed on the status line showing Displaying results in Figure 1. On viewing a URL a user may mark URLs in the retrieved list as being relevant (using that button) and this is shown by a colour change for the relevant URLs in the applet’s results listing (not shown on the diagrams). Having viewed some URLs and marked some of them as relevant, a user may invoke the “Expand Query” command by pressing that button. This sends the list of URLs marked as relevant back to our server process which then retrieves those full pages, in parallel, from the web (the earlier retrieval of these pages had been done by the user for display on the user’s machine and not on our server and hence they are not cached for us). Pages not returned from the web within a time-out period are discarded from further processing. The text of the retrieved URLs is then analysed by removing HTML tags and stopwords from each, and stemming the remaining text using Porter’s word stemmer [Porter 80]. From this we extract a list of candidate search terms which are word stems. Each of these candidate search terms which are not an original query term, taken from known relevant URLs, is then scored using a search term ranking formula. In [Efthimiadis 95], 8 different formulae for ranking candidate search terms for query expansion were evaluated using an operational information retrieval system. In this work the evaluation was based on how close the formula ranking matched the choice of search terms to add as made by a user. Thus the best of the formulae ranked candidate search terms in such a way that the highest-ranked ones were the ones chosen by a user in an operational setting. Of the 8 formulae tried, the successful ones (Porter, emim and the wpq formulae) all used the following parameters when scoring an individual search term: N is total number of documents in the collection, n is the number of those documents indexed by term t, R is the size of the sample of relevant documents identified so far by the user, and r is the number of those relevant documents which are indexed by term t. For the case of searching the web where both N and n are unknown, these parameters are difficult to estimate so part of our work involves determining the most appropriate formula by which to rank candidate search terms. It would seem that the simplest one, r/R, may be the best to use. It is certainly the easiest to implement, but our work will reveal whether it is appropriate or not. Once the candidate search terms have been ranked the top ones scored above a threshold are sent back to the user’s applet for display, not as word stems but as the set of word form occurrences which are reduced to that word stem. This causes a new pane to open up on the user’s applet as shown in Figure 2.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Query expansion based on relevance feedback and latent semantic analysis

Web search engines are one of the most popular tools on the Internet which are widely-used by expert and novice users. Constructing an adequate query which represents the best specification of users’ information need to the search engine is an important concern of web users. Query expansion is a way to reduce this concern and increase user satisfaction. In this paper, a new method of query expa...

متن کامل

UBML participation to CLEF eHealth IR challenge 2015: Task 2

This paper describes the participation of UBML, a team composed with members of the Department of Computer Science, University of Botswana, to the biomedical information retrieval challenge proposed in the framework of CLEF eHealth 2015 Task 2. For this first participation, we are evaluating the effectiveness of two different query expansion strategies when searching for health related content ...

متن کامل

Thesaurus-Based Feedback to Support Mixed Search and Browsing Environments

We propose and evaluate a query expansion mechanism that supports searching and browsing in collections of annotated documents. Based on generative language models, our feedback mechanism uses document-level annotations to bias the generation of expansion terms and to generate browsing suggestions in the form of concepts selected from a controlled vocabulary (as typically used in digital librar...

متن کامل

Concept Based Intelligent Information Retrieval within Digital Library

A digital library is a type of information retrieval (IR) system. The existing information retrieval methodologies generally have problems on keyword-searching. We proposed a model to solve the problem by using concept-based approach (ontology) and metadata case base. This model consists of identifying domain concepts in user’s query and applying expansion to them. The system aims at contributi...

متن کامل

Analysis of users’ query reformulation behavior in Web with regard to Wholis-tic/analytic cognitive styles, Web experience, and search task type

Background and Aim: The basic aim of the present study is to investigate users’ query reformulation behavior with regard to wholistic-analytic cognitive styles, search task type, and experience variables in using the Web. Method: This study is an applied research using survey method. A total of 321 search queries were submitted by 44 users. Data collection tools were Riding’s Cognitive Style A...

متن کامل